βοΈ SWE-rebench: Nebius AI R&D team presents new dataset for SWE tasks.
Researchers built an automated system to collect and validate thousands of real-world tasks from GitHub, designed for training and evaluation of LLMs in software engineering.
Main features of the system:
1οΈβ£ Automatic data collection: Continuously extracts issue-PR pairs from Python repositories.
2οΈβ£ LLM-based environment setup: LLM analyzes repositories, creates install instructions, and updates them if errors happen.
3οΈβ£ Execution-based validation: Each task is tested by automatic setup, test run, and dependency freezing to make it reproducible.
4οΈβ£ LLM quality annotation: Tasks are labeled for clarity, difficulty, and test correctness to support filtering.
Result:
SWE-rebench dataset: 21,000+ ready-to-use interactive tasks.
Continuous updates: Fresh data is added regularly.
Transparent evaluation: Tasks are used for public SWE-rebench leaderboard.
π SWE-rebench gives researchers and developers real and validated tasks to work with LLMs in SWE field.
Technical report: arXiv
Dataset: SWE-rebench
Researchers built an automated system to collect and validate thousands of real-world tasks from GitHub, designed for training and evaluation of LLMs in software engineering.
Main features of the system:
1οΈβ£ Automatic data collection: Continuously extracts issue-PR pairs from Python repositories.
2οΈβ£ LLM-based environment setup: LLM analyzes repositories, creates install instructions, and updates them if errors happen.
3οΈβ£ Execution-based validation: Each task is tested by automatic setup, test run, and dependency freezing to make it reproducible.
4οΈβ£ LLM quality annotation: Tasks are labeled for clarity, difficulty, and test correctness to support filtering.
Result:
SWE-rebench dataset: 21,000+ ready-to-use interactive tasks.
Continuous updates: Fresh data is added regularly.
Transparent evaluation: Tasks are used for public SWE-rebench leaderboard.
π SWE-rebench gives researchers and developers real and validated tasks to work with LLMs in SWE field.
Technical report: arXiv
Dataset: SWE-rebench
tg-me.com/opendatascience/2331
Create:
Last Update:
Last Update:
βοΈ SWE-rebench: Nebius AI R&D team presents new dataset for SWE tasks.
Researchers built an automated system to collect and validate thousands of real-world tasks from GitHub, designed for training and evaluation of LLMs in software engineering.
Main features of the system:
1οΈβ£ Automatic data collection: Continuously extracts issue-PR pairs from Python repositories.
2οΈβ£ LLM-based environment setup: LLM analyzes repositories, creates install instructions, and updates them if errors happen.
3οΈβ£ Execution-based validation: Each task is tested by automatic setup, test run, and dependency freezing to make it reproducible.
4οΈβ£ LLM quality annotation: Tasks are labeled for clarity, difficulty, and test correctness to support filtering.
Result:
SWE-rebench dataset: 21,000+ ready-to-use interactive tasks.
Continuous updates: Fresh data is added regularly.
Transparent evaluation: Tasks are used for public SWE-rebench leaderboard.
π SWE-rebench gives researchers and developers real and validated tasks to work with LLMs in SWE field.
Technical report: arXiv
Dataset: SWE-rebench
Researchers built an automated system to collect and validate thousands of real-world tasks from GitHub, designed for training and evaluation of LLMs in software engineering.
Main features of the system:
1οΈβ£ Automatic data collection: Continuously extracts issue-PR pairs from Python repositories.
2οΈβ£ LLM-based environment setup: LLM analyzes repositories, creates install instructions, and updates them if errors happen.
3οΈβ£ Execution-based validation: Each task is tested by automatic setup, test run, and dependency freezing to make it reproducible.
4οΈβ£ LLM quality annotation: Tasks are labeled for clarity, difficulty, and test correctness to support filtering.
Result:
SWE-rebench dataset: 21,000+ ready-to-use interactive tasks.
Continuous updates: Fresh data is added regularly.
Transparent evaluation: Tasks are used for public SWE-rebench leaderboard.
π SWE-rebench gives researchers and developers real and validated tasks to work with LLMs in SWE field.
Technical report: arXiv
Dataset: SWE-rebench
BY Data Science by ODS.ai π¦


Share with your friend now:
tg-me.com/opendatascience/2331